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Abstract 

There has been significant recent work on the theory and application of randomized coordinate descent 
algorithms, beginning with the work of Nesterov] [SIAM J. Optim., 22(2), 2012], who showed that a 
random-coordinate selection rule achieves the same convergence rate as the Gauss-Southwell selection 
rule. This result suggests that we should never use the Gauss-Southwell rule, because it is typically 
much more expensive than random selection. However, the empirical behaviours of these algorithms 
contradict this theoretical result: in applications where the computational costs of the selection rules 
are comparable, the Gauss-Southwell selection rule tends to perform substantially better than random 
coordinate selection. We give a simple analysis of the Gauss-Southwell rule showing that—except in 
extreme cases—its convergence rate is faster than choosing random coordinates. We also (i) show that 
exact coordinate optimization improves the convergence rate for certain sparse problems, (ii) propose a 
Gauss-Southwell-Lipschitz rule that gives an even faster convergence rate given knowledge of the Lipschitz 
constants of the partial derivatives, (iii) analyze the effect of approximate Gauss-Southwell rules, and 
(iv) analyze proximal-gradient variants of the Gauss-Southwell rule. 


1 Coordinate Descent Methods 


There has been substantial recent interest in applying coordinate descent methods to solve large-scale op¬ 
timization problems, starting with the seminal work of Nesterov 2012 , who gave the first global rate-of- 


convergence analysis for coordinate-descent methods for minimizing convex functions. This analysis suggests 
that choosing a random coordinate to update gives the same performance as choosing the “best” coordi¬ 
nate to update via the more expensive Gauss-Southwell (GS) rule. (Nesterov also proposed a more clever 
randomized scheme, which we consider later in this paper.) This result gives a compelling argument to use 
randomized coordinate descent in contexts where the GS rule is too expensive. It also suggests that there 
is no benefit to using the GS rule in contexts where it is relatively cheap. But in these contexts, the GS 
rule often substantially outperforms randomized coordinate selection in practice. This suggests that either 
the analysis of GS is not tight, or that there exists a class of functions for which the GS rule is as slow as 
randomized coordinate descent. 

After discussing contexts in which it makes sense to use coordinate descent and the GS rule, we answer 
this theoretical question by giving a tighter analysis of the GS rule (under strong-convexity and standard 
smoothness assumptions) that yields the same rate as the randomized method for a restricted class of 
functions, but is otherwise faster (and in some cases substantially faster). We further show that, compared 
to the usual constant step-size update of the coordinate, the GS method with exact coordinate optimization 
has a provably faster rate for problems satisfying a certain sparsity constraint (Section]^. We believe that 
this is the first result showing a theoretical benefit of exact coordinate optimization; all previous analyses 
show that these strategies obtain the same rate as constant step-size updates, even though exact optimization 
tends to be faster in practice. Furthermore, in Section]^ we propose a variant of the GS rule that, similar 
to Nesterov’s more clever randomized sampling scheme, uses knowledge of the Lipschitz constants of the 
coordinate-wise gradients to obtain a faster rate. We also analyze approximate GS rules (Section]^, which 
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provide an intermediate strategy between randomized methods and the exact GS rule. Finally, we analyze 
proximal-gradient variants of the GS rule (Section for optimizing problems that include a separable non¬ 
smooth term. 


2 Problems of Interest 

The rates of Nesterov show that coordinate descent can be faster than gradient descent in cases where, if we 
are optimizing n variables, the cost of performing n coordinate updates is similar to the cost of performing 
one full gradient iteration. This essentially means that coordinate descent methods are useful for minimizing 
convex functions that can be expressed in one of the following two forms: 


^ 1 ( 2 ;) := X !+ f{Ax), 
2=1 


h2ix) := 

iev 




fijixi, Xj), 


where Xi is element i oi x, f is smooth and cheap, the are smooth, G = {V,E} is a graph, and A is a 
matrix. (It is assumed that all functions are convex.The family of functions hi includes core machine¬ 
learning problems such as least squares, logistic regression, lasso, and SVMs (when solved in dual form) [Hsieh 


et al. 2008|. Family ^2 includes quadratic functions, graph-based label propagation algorithms for semi- 


graphical models Rue and Held, 2005 


supervised learning Bengio et al. 2006 , and finding the most likely assignments in continuous pairwise 


In general, the GS rule for problem /12 is as expensive as a full gradient evaluation. However, the structure 
of G often allows efficient implementation of the GS rule. For example, if each node has at most d neighbours, 
we can track the gradients of all the variables and use a max-heap structure to implement the GS rule in 
0{d\ogn) time Meshi et al. 2012 . This is similar to the cost of the randomized algorithm if d « \E\/n (since 


the average cost of the randomized method depends on the average degree). This condition is true in a variety 
of applications. For example, in spatial statistics we often use two-dimensional grid-structured graphs, where 
the maximum degree is four and the average degree is slightly less than 4. As another example, for applying 
graph-based label propagation on the Facebook graph (to detect the spread of diseases, for example), the 
average number of friends is around 200 but no user has more than seven thousand friendsj^ The maximum 
number of friends would be even smaller if we removed edges based on proximity. A non-sparse example 
where GS is efficient is complete graphs, since here the average degree and maximum degree are both (n—1). 
Thus, the GS rule is efficient for optimizing dense quadratic functions. On the other hand, GS could be very 
inefficient for star graphs. 

If each column of A has at most c non-zeroes and each row has at most r non-zeroes, then for many 
notable instances of problem hi we can implement the GS rule in 0(cr log n) time by maintaining Ax as well 
as the gradient and again using a max-heap (see Appen dix [A|) . Thus, GS will be efficient if cr is similar to 
the number of non-zeroes in A divided by n. Otherwise, Dhillon et al. 2011 show that we can approximate 
the GS rule for problem hi with no gi functions by solving a nearest-neighbour problem. Their analysis 
of the GS rule in the convex case, however, gives the same convergence rate that is obtained by random 


selection (although the constant factor can be smaller by a factor of up to n). More recently, Shrivastava 
and Li [2014| give a general method for approximating the GS rule for problem hi with no gt functions by 
writing it as a maximum inner-product search problem. 


3 Existing Analysis 

We are interested in solving the convex optimization problem 


min fix), 
xGR" ^ 


(X) 


^We could also consider slightly more general cases like functions that are defined on hyper-edges 
, provided that we can still perform n coordinate updates for a similar cost to one gradient evaluation, 
https://recordsetter.com/world-record/facebook-friends 


Richtarik and Takac 
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where V/ is coordinate-wise L-Lipschitz continuous, i.e., for each i = 1,... ,n, 

|Vi/(a; -I- aci) — Vi/(a;)| < L\a\, Vx S K" and a G K, 

where is a vector with a one in position i and zero in all other positions. For twice-differentiable functions, 
this is equivalent to the assumption that the diagonal elements of the Hessian are bounded in magnitude 
by L. In contrast, the typical assumption used for gradient methods is that V/ is L^-Lipschitz continuous 
(note that L < < Ln). The coordinate-descent method with constant step-size is based on the iteration 


The randomized coordinate-selection rule chooses ij- uniformly from the set {1,2,..., n}. Alternatively, the 
GS rule 

ik = argmax |Vi/(x'")|, 

i 

chooses the coordinate with the largest directional derivative. Under either rule, because / is coordinate-wise 
Lipschitz continuous, we obtain the following bound on the progress made by each iteration: 

/(x"+i) < + V, J(x'=)(x^+i - + |(x'=+i - x^)l 


= /(x'=)-l(V.,/(x'=))2 + | 
= /(^'=)-^[V../(x'=)]^ 




We focus on the case where / is /r-strongly convex, meaning that, for some positive /i. 


( 2 ) 


/(y) > fix) + (V/(x),y-x) -b ||ly-xf, Vx,y G 


( 3 ) 


which implies that 


fix*)>fix'^)-^\\^fix'^)r, 


( 4 ) 


where x* is the optimal solution of Q . This bound is obtained by minimizing both sides of ([^ with respect 
to y. 


3.1 Randomized Coordinate Descent 

Conditioning on the cr-field J-k-i generated by the sequence {x°,x^,... ,x^“^}, and taking expectations of 
both sides of ([^, when ik is chosen with uniform sampling we obtain 


E[/(x'=+^)] < E 




n 

Z=1 




Using 0 and subtracting fix*) from both sides, we get 

E[/(x'=+i)] - fix*) < (l - £) [fi^Y - fix*)]- 


This is a special of case of Nesterov 2012 Theorem 2] with a = 0 in his notation. 


( 5 ) 
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3.2 Gauss-Southwell 

We now consider the progress implied by the GS rule. By the definition of ik, 




( 6 ) 


Applying this inequality to ([^, we obtain 




which together with 0. implies that 

f{x'^+^) - fix*) < (l - £) [/(^") - /(^*)]- 

This is a special case of [Boyd and Vandenberghe 2004 §9.4.3], viewing the GS rule as performing steepest 
descent in the 1-norm. While this is faster than known rates for cyclic coordinate selection [Beck and' 


(7) 


Tetruashvili 2013 and holds deterministically rather than in expectation, this rate is the same as the 
randomized rate given in ([^. 


4 Refined Gauss-Southwell Analysis 

The deficiency of the existing GS analysis is that too much is lost when we use the inequality in Q. To 
avoid the need to use this inequality, we instead measure strong-convexity in the 1-norm, i.e., 


f{y) > fix) + {Vfix),y-x) + ^\\y-x\\l, 


which is the analogue of (|^. Minimizing both sides with respect to y, we obtain 

fix*) > /(a:) -sup{(-V/(a;),y-a;) - ^|jy-a;||?} 

= fix)-{^\\-\\l)*i-^fix)) (8) 

= fix) - ;^l|V/(a;)||^, 

2yi 


Boyd and Vandenberghe 


which makes use of the convex conjugate (^|| • jjf)* = 

Using Q in ([^, and the fact that i*^ik fix^))'^ = ||V/(a;^)||^ for the GS rule, we obtain 


1 

2^1 I 


2004 


§3.3]. 


fix^+^) - fix*) < (l - y) [fix’^) - fix*)]. 


(9) 


It is evident that if yi = /i/n, then the rates implied by ([^ and 0 are identical, but 0 is faster if 
yi > [ijn. In Appendix]^ we show that the relationship between y and can be obtained through the 
relationship between the squared norms jj • and jj • jjf. In particular, we have 


/i 

- < < y. 

n 


Thus, at one extreme the GS rule obtains the same rate as uniform selection (/ri « y/n). However, at the 
other extreme, it could be faster than uniform selection by a factor of n (/ri « fi). This analysis, that the 
GS rule only obtains the same bound as random selection in an extreme case, supports the better practical 
behaviour of GS. 
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4.1 Comparison for Separable Quadratic 

We illustrate these two extremes with the simple example of a quadratic function with a diagonal Hessian 
= diag(()Ai,..., A„). In this case, 


fi = min Xi, and p,i 

i 



We prove the correctness of this formula for /ri in Appendix]^ The parameter [jl\ achieves its lower bound 
when all Xi are equal, Ai = • • • = A„ = a > 0 , in which case 


/j, = a and p,i = ajn. 


Thus, uniform selection does as well as the GS rule if all elements of the gradient change at exactly the same 
rate. This is reasonable: under this condition, there is no apparent advantage in selecting the coordinate to 
update in a clever way. Intuitively, one might expect that the favourable case for the Gauss-Southwell rule 
would be where one Xi is much larger than the others. However, in this case, /ii is again similar to /r/n. To 
achieve the other extreme, suppose that Ai = /3 and A 2 = A 3 = • • • = A„ = a with a > (3. In this case, we 
have A* = /3 and 

I3a 

Q,n-1 _ l)^Q,n-2 Q, _|_ 

If we take a —>■ 00 , then we have fj-i —>■ /3, so fxi —>■ /r. This case is much less intuitive; GS is n times faster 
than random coordinate selection if one element of the gradient changes much more slowly than the others. 


4.2 ‘Working Together’ Interpretation 

In the separable quadratic case above, fXi is given by the harmonic mean of the eigenvalues of the Hessian 
divided by n. The harmonic mean is dominated by its smallest values, and this is why having one small 
value is a notable case. Furthermore, the harmonic mean divided by n has an interpretation in terms of 
processes ‘working together’ Ferger 1931 . If each Xi represents the time taken by each process to finish a 


task (e.g., large values of Xi correspond to slow workers), then /r is the time needed by the fastest worker 
to complete the task, and a*! is the time needed to complete the task if all processes work together (and 
have independent effects). Using this interpretation, the GS rule provides the most benefit over random 
selection when working together is not efficient, meaning that if the n processes work together, then the task 
is not solved much faster than if the fastest worker performed the task alone. This gives an interpretation of 
the non-intuitive scenario where GS provides the most benefit: if all workers have the same efficiency, then 
working together solves the problem n times faster. Similarly, if there is one slow worker (large A^), then the 
problem is solved roughly n times faster by working together. On the other hand, if most workers are slow 
(many large Xi), then working together has little benefit. 


4.3 Fast Convergence with Bias Term 

Consider the standard linear-prediction framework, 

ra » 

argmin^ /(afa; -b P) -b -\\xf -b 

x,P , ^ ^ 

1—1 

where we have included a bias variable /3 (an example of problem hi). Typically, the regularization parameter 
a of the bias variable is set to be much smaller than the regularization parameter A of the other covariates, 
to avoid biasing against a global shift in the predictor. Assuming that there is no hidden strong-convexity in 
the sum, this problem has the structure described in the previous section (yti « /r) where GS has the most 
benefit over random selection. 
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5 Rates with Different Lipschitz Constants 


Consider the more general scenario where we have a Lipschitz constant Li for the partial derivative of / with 
respect to each coordinate i, 

|Vi/(x + aei) — Vif{x)\ < Li\a\, Vx S K" and a G M, 


and we use a coordinate-dependent step-size at each iteration: 




= x’^ - 


By the logic of in this setting we have 

- 

and thus a convergence rate of 


2L,; 




Noting that L = maxijLi}, we have 

k 

n 


n 

i=i 


1 - 


L,, 




1 - 


L,-- 




( 10 ) 


( 11 ) 


( 12 ) 


(13) 


Thus, the convergence rate based on the Li will be faster, provided that at least one iteration chooses an i^. 
with Li^, < L. In the worst case, however, (131 holds with equality even if the Li are distinct, as we might 
need to update a coordinate with Li = L on every iteration. (For example, consider a separable function 
where all but one coordinate is initialized at its optimal value, and the remaining coordinate has Li = L.) 
In Section]^ we discuss selection rules that incorporate the Li to achieve faster rates whenever the Li are 
distinct, but first we consider the effect of exact coordinate optimization on the choice of the 


5.1 Gauss-Southwell with Exact Optimization 


For problems involving functions of the form hi and h 2 , we are often able to perform exact (or numerically 
very precise) coordinate optimization, even if the objective function is not quadratic (e.g., by using a line- 
search or a closed-form update). Note that (121 still holds when using exact coordinate optimization rather 
than using a step-size of 1 /Lij,, as in this case we have 


f{x^+^) = min{/(x*^ -b aeij} 

Oi 

< f (x^ - (14) 

which is equivalent to However, in practice using exact coordinate optimization leads to better perfor¬ 
mance. In this section, we show that using the GS rule results in a convergence rate that is indeed faster 
than © for problems with distinct Li when the function is quadratic, or when the function is not quadratic 
but we perform exact coordinate optimization. 

The key property we use is that, after we have performed exact coordinate optimization, we are guaranteed 
to have \7i^f{x^~^^) = 0. Because the GS rule chooses ik+i = argmaxj \\7if{x^~^^)\, we cannot have ik+i = ik, 
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unless is the optimal solution. Hence, we never choose the same coordinate twice in a row, which 


guarantees that the inequality (13) is strict (with distinct Li) and exact coordinate optimization is faster. 
We note that the improvement may be marginal, as we may simply alternate between the two largest Li 
values. However, consider minimizing /12 when the graph is sparse; after updating we are guaranteed to 
have = 0 for all future iterations (k + m) until we choose a variable ik+m-i that is a neighbour 

of node ik in the graph. Thus, if the two largest Li are not connected in the graph, GS cannot simply 
alternate between the two largest Li. 

By using this property, in Appendix]^ we show that the GS rule with exact coordinate optimization for 
problem h 2 under a chain-structured graph has a convergence rate of the form 


- f{x*) < O (max{p^,pf}*^) [f{x°) - fix*)], 


where is the maximizer of -^(1 — fii/Li))! — ^i/Lj) among all consecutive nodes i and j in the chain, 
and is the maximizer of ^(1 — pi/Li)(l — /ii/Lj)(l — pi/Lfe) among consecutive nodes i, j, and k. The 
implication of this result is that, if the large Li values are more than two edges from each other in the graph, 
then we obtain a much better convergence rate. We conjecture that for general graphs, we can obtain a 
bound that depends on the largest value of p^ among all nodes i and j connected by a path of length 1 or 2. 
Note that we can obtain similar results for problem hi, by forming a graph that has an edge between nodes 
i and j whenever the corresponding variables are both jointly non-zero in at least one row of A. 


6 Rules Depending on Lipschitz Constants 


If the Li are known, |Nestero\^ [2012] showed that we can obtain a faster convergence rate by sampling 
proportional to the Li. We review this result below and compare it to the GS rule, and then propose an 
improved GS rule for this scenario. Although in this section we will assume that the Li are known, this 


assumption can be relaxed using a backtracking procedure Nesterov 2012, §6.1]. 


6.1 Lipschitz Sampling 

Taking the expectation of © under the distribution pi = Li j proceeding as before, we obtain 


E[/(x'=+i)] - fix*) < (1 - [fi^’^) - fn 


where L = ^ average of the Lipschitz constants. This was shown by Leventhal and Lewis 

2010] and is a special case of Nesterov 2012 Theorem 2] with a = 1 in his notation. This rate is taster 
than ([^ for uniform sampling if any Li differ. 

Under our analysis, this rate may or may not be faster than (|^ for the GS rule. On the one extreme, if 
Pi = p/n and any Li differ, then this Lipschitz sampling scheme is faster than our rate for GS. Indeed, in 
the context of the problem from Section [4T] we can make Lipschitz sampling faster than GS by a factor of 
nearly n by making one Xi much larger than all the others (recall that our analysis shows no benefit to the 
GS rule over randomized selection when only one Xi is much larger than the others). At the other extreme. 


in our example from Section 4.1 with many large a and one small /3, the GS and Lipschitz sampling rates 
are the same when n = 2, with a rate of (1 — /3/(a -f /?)). However, the GS rate will be faster than the 
Lipschitz sampling rate for any a > /3 when n > 2, as the Lipschitz sampling rate is (1 — /3/((n — l)a-|-/3)), 
which is slower than the GS rate of (1 — /3/(a -I- (n — l)/3)). 


6.2 Gauss-Southwell-Lipschitz Rule 

Since neither Lipschitz sampling nor GS dominates the other in general, we are motivated to consider if 
faster rules are possible by combining the two approaches. Indeed, we obtain a faster rate by choosing the 
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ik that minimizes leading to the rule 


|VJ(x"')l 

Ik = argmax--=—, 

I V 

which we call the Gauss-Southwell-Lipschitz (GSL) rule. Following a similar argument to Section but 
using (11) in place of ([^, the GSL rule obtains a convergence rate of 

f{x’‘+^) - f{x*) < (1 - HL)[f{x'') - f{x*)], 


where /tl is the strong-convexity constant with respect to the norm 
Appendix [E} and in Appendix [F] we show that 


\l = Sr=i V^kil- This is shown in 


f M Ml 


'} < ML 


< 


Ml 


mini!Li} ’ 


Thus, the GSL rule is always at least as fast as the fastest of the GS rule and Lipschitz sampling. Indeed, 
it can be more than a factor of n faster than using Lipschitz sampling, while it can obtain a rate closer to 
the minimum Li, instead of the maximum Li that the classic GS rule depends on. 

An interesting property of the GSL rule for quadratic functions is that it is the optimal myopic coordinate 
update. That is, if we have an oracle that can choose the coordinate and the step-size that decreases / by 
the largest amount, i.e., 

= argmin{/(a;'' + aei)}, (15) 


this is equivalent to using the GSL rule and the update in (10). This follows because © holds with equality 
in the quadratic case, and the choice ak = 1/Li^ yields the optimal step-size. Thus, although faster schemes 
could be possible with non-myopic strategies that cleverly choose the sequence of coordinates or step-sizes, 
if we can only perform one iteration, then the GSL rule cannot be improved. 


For general /, (15) is known as the maximum improvement (MI) rule. This rule has been used in the con¬ 
text of boosting Ratsch et al. 2001 , graphical models Della Pietra et al. 1997 Lee et al. 2006, Scheinberg 
and Rish 2009 , Gaussian processes Bo and Sminchisescu |2008|, and low-rank tensor approximations |Li 


et al.||20i5 . Using an argument similar to (|14[), our GSL rate also applies to the MI rule, improving existing 


bounds on this strategy. However, the GSL rule is much cheaper and does not require any special structure 
(recall that we can estimate L^ as we go). 


6.3 Connection between GSL Rule and Normalized Nearest Neighbour Search 


Dhillon et al. 2011 discuss an interesting connection between the GS rule and the nearest-neighbour-search 


(NNS) problem for objectives of the form 


min F{x) = f{Ax), (16) 

This is a special case of hi with no gi functions, and its gradient has the special form 

VF{x) = A^r(x), 

where r{x) = Wf{Ax). We use the symbol r because it is the residual vector {Ax — b) in the special case of 
least squares. For this problem structure the GS rule has the form 

ik = argmax I Vi/(a:'') I 

i 

= argmax |r(a;^)^ai|, 
















































where ai denotes column i oi A for i = 1,... ,n. 
argmax by solving the following NNS problem 


Dhillon et al. 2011 propose to approximate the above 


Ik = argmm | 

ie[2n] 




where i in the range (n + 1 ) through 2n refers to the negation —(ui-n) of column (i — n) and if the selected 
ik is greater than n we return (i — n). We can justify this approximation using the logic 

ik = argmin||r(a;'') - ai\\ 

iG[2n] 

= argmin i||r(a;'') - Oilp 

ie[2n] ^ 

= argmin -r(x'" fai + ^||aif 

iG[2n] A ^ 

constant 

= argmaxr(a;'')^ai - ^||ai|p 

ie[2n] ^ 

= argmax |r(a;'")'^ajI - 

iG [n] ^ 

= argmax|V*/(a;'')| - ^|lai|p. 

iG [n] ^ 


Thus, the NNS computes an approximation to the GS rule that is biased towards coordinates where |jai|| 
is small. Note that this formulation is equivalent to the GS rule in the special case that ||ai|| = 1 (or any 
other constant) for all i. Shrivastava and Li 2014 have more recently considered the case where ||ai|| < 1 
and incorporate powers of ||ai|| in the NNS to yield a better approximation. 

A further interesting property of the GSL rule is that we can often formulate the exact GSL rule as a 
normalized NNS problem. In particular, for problem (16 1 the Lipschitz constants will often have the form 
Li = 7 ||ai|P for a some positive scalar 7 . For example, least squares has 7=1 and logistic regression has 
7 = 0.25. When the Lipschitz constants have this form, we can compute the exact GSL rule by solving a 
normalized NNS problem. 


Ik = argmm 

iG[2n] 


ix>^) - 


(17) 


The exactness of this formula follows because 


ik 


argmin 

iG[2n] 


r{x'^) 



argmin iIIr(a;'') - ai/\\ai 

iG[2n] ^ 



constant 


argmax 

iG[n] 

argmax 

iG[n] 

argmax 

iG[ra] 


|r(a;'")'^a^| 

|r(:r'-)^a.| 



constant 
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Thus, the form of the Lipschitz constant conveniently removes the bias towards smaller values of |jai|| that 
gets introduced when we try to formulate the classic GS rule as a NNS problem. Interestingly, in this setting 
we do not need to know 7 to implement the GSL rule as a NNS problem. 


7 Approximate Gauss-Southwell 


In many applications, computing the exact GS rule is too inefficient to be of any practical use. However, a 
computationally cheaper approximate GS rule might be available. Approximate GS rules under multiplicative 
and additive errors were considered by Dhillon et al. 2011 in the convex case, but in this setting the 
convergence rate is similar to the rate achieved by random selection. In this section, we give rates depending 
on pLi for approximate GS rules. 

7.1 Multiplicative Errors 

In the multiplicative error regime, the approximate GS rule chooses an satisfying 

|V,,/(x'=)|>||V/(cr^)|U(l-efc), 

for some Ck G [0,1). In this regime, our basic bound on the progress ([^ still holds, as it was defined for any 
ik- We can incorporate this type of error into our lower bound Q to obtain 

/(cr*)>/(x'=)-^||V/(x'=)||^ 


2 /ii 


> - 


2^1(1 - efc )2 


This implies a convergence rate of 


/(x'=+^) - f{x*) < 1 - 


Mi(l - f-kf 






Thus, the convergence rate of the method is nearly identical to using the exact GS rule for small €k (and it 
degrades gracefully with efc). This is in contrast to having an error in the gradient Friedlander and Schmidt [ 
20 I 2 |, where the error e must decrease to zero over time. 


7.2 Additive Errors 

In the additive error regime, the approximate GS rule chooses an ik satisfying 

|V.,/(x'=)|>|lV/(x'=)||oo-efe, 

for some Ck > 0. In Appendix we show that under this rule, we have 

fix'") - f{x*) < (1 - [f{x°) - fix*) + Ak] , 


where 


Ak < min (1 - 

li=l i=l 

where Li is the Lipschitz constant of V/ with respect to the I-norm. Note that Li could be substantially 
larger than L, so the second part of the maximum in Ak is likely to be the smaller part unless the et are 
large. This regime is closer to the case of having an error in the gradient, as to obtain convergence the ej. 


1 - 




^Vfix°)-fix*) + ^ 
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must decrease to zero. This result implies that a sufficient condition for the algorithm to obtain a linear 
convergence rate is that the errors €k converge to zero at a linear rate. Further, if the errors satisfy €k = 0{p^) 
for some p < (1 — pi/L), then the convergence rate of the method is the same as if we used an exact GS 
rule. On the other hand, if does not decrease to zero, we may end up repeatedly updating the same wrong 
coordinate and the algorithm will not converge (though we could switch to the randomized method if this is 
detected). 

8 Proximal-Gradient Gauss-Southwell 

One of the key motivations for the resurgence of interest in coordinate descent methods is their performance 
on problems of the form 

n 

2=1 

where / is smooth and convex and the gi are convex, but possibly non-smooth. This includes problems with 
fi-regularization, and optimization with lower and/or upper bounds on the variables. Similar to proximal- 
gradient methods, we can apply the proximal operator to the coordinate update. 


= proxi„ 

L y^k 


where 


Prox^gjy] = argmin -||a; - yf + agi{x). 


With random coordinate selection, Richtarik and Takac 2014] show that this method has a convergence rate 
of 


E[i^(a;'=+i) - F{x*)] [F{x'^) - F{x*)], 

V nL/ 


similar to the unconstrained/smooth case. 

There are several generalizations of the GS rule to this scenario. Here we consider three possibilities, all 
of which are equivalent to the GS rule if the gi are not present. First, the GS-s rule chooses the coordinate 


with the most negative directional derivative. This strategy is popular for £i-regularization Shevade and 
Keerthi, 2003 Wu and Lange 2008 Li and Osher] 2009 and in general is given by [see Bertsekas 1999 §8.4] 


ik = argmax { min |Vi/(a;^) -I- s| 

i is^dgt 

However, the length of the step — a:^||) could be arbitrarily small under this choice. In contrast, the 


2011 


GS-r rule chooses the coordinate that maximizes the length of the step Tseng and Yun, 2009 Dhillon et al. 

I 


Ik = argmax 


Xi -proxi 




This rule is effective for bound-constrained problems, but it ignores the change in the non-smoot h term 
{9i{^i^^) ~9ii^k))- Finally, the GS-q rule maximizes progress assuming a quadratic upper bound on / Tseng 
and Yun] 2009] , 


ik = argmin | min {/(x'") -b Vif{x'")d + + gi{x^ + d) - gi{Xi)} 

i d Z 


While the least intuitive rule, the GS-g rule seems to have the best theoretical properties. Further, if we use 
Li in place of L in the GS-q rule (which we call the GSL-g strategy), then we obtain the GSL rule if the gi 
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are not present. In contrast, using Li in place of L in the GS-r rule (which we call the GSL-r strategy) does 
not yield the GSL rule as a special case. 

In Appendix]^ we show that using the GS-q rule yields a convergence rate of 

- F{x*) < min | (l “ (l - [f{x^) - f{x*)] + Cfej, 

where is bounded above by a measure of the non-linearity of the gi along the possible coordinate updates 
times the inverse condition number gi/L. Note that goes to zero as k increases and we conjecture that the 
above bound actually holds with = 0. In contrast, in Appendix|^we also give counter-examples showing 
that the above rate does not hold with = 0 for the GS-s or GS-r rule, even if you replace the minimum 
by a maximum. Thus, any bound for the GS-s or GS-r rule would be slower than the expected rate under 
random selection, while the GS-q rule leads to a better bound. 


9 Experiments 


We first compare the efficacy of different coordinate selection rules on the following simple instances of hi. 
^2-regularized sparse least squares: Here we consider the problem 

mm^llAx- 6f -f ^||a;f, 

an instance of problem hi. We set A to be an m by n matrix with entries sampled from a A/'(0,1) distribution 
(with m = 1000 and n = 1000). We then added 1 to each entry (to induce a dependency between columns), 
multiplied each column by a sample from Af{0, 1) multiplied by ten (to induce different Lipschitz constants 
across the coordinates), and only kept each entry of A non-zero with probability 101og(n)/n (a sparsity level 
that allows the Gauss-Southwell rule to be applied with cost 0(log^(n)). We set A = 1 and b = Ax -I- e, 
where the entries of x and e were drawn from a A/'(0,1) distribution. In this setting, we used a step-size of 
1 /Li for each coordinate i, which corresponds to exact coordinate optimization. 

£2-regularized sparse logistic regression: Here we consider the problem 

1 " A 

min-Vlog(l -f exp(-&iafa;)) -b -||xf. 

X n 2 


We set the aj to be the rows of A from the previous problem, and set b = sign(Aa;), but randomly flipping 
each bi with probability 0.1. In this setting, we compared using a step-size of 1/Li to using exact coordinate 
optimization. 

Over-determined dense least squares: Here we consider the problem 

min;^||Aa;-6f, 

X In 


but, unlike the previous case, we do not set elements of A to zero and we make A have dimension 1000 by 100. 
Because the system is over-determined, it does not need an explicit strongly-convex regularizer to induce 
global strong-convexity. In this case, the density level means that the exact GS rule is not efficient. Hence, 


we use a balltree structure |Omohundro 

f— ‘ 

CD 

00 

CO 

to implement an efficient approximate GS rufe based on the 

connection to the NNS problem discovered by 

Dhillon et al. 

2011 . On the other hand, we can compute the 


exact GSL rule for this problem as a NNS problem as discussed in Section 6.3 


i-regularized underdetermined sparse least squares: Here we consider the non-smooth problem 


min — II Ax 
X 2n 


6f+ A||x||i. 


We generate A as we did for the £ 2 -regularized sparse least squares problem, except with the dimension 1000 
by 10000. This problem is not globally strongly-convex, but will be strongly-convex along the dimensions 
that are non-zero in the optimal solution. 
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Objective Objective 



Figure 1: Comparison of coordinate selection rules for 4 instances of problem hi. 
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Graph-based label propagation 



Epochs 


Figure 2: Comparison of coordinate selection rules for graph-based semi-supervised learning. 


We plot the objective function (divided by its initial value) of coordinate descent under different selection 
rules in Figure Even on these simple datasets, we see dramatic differences in performance between the 
different strategies. In particular, the GS rule outperforms random coordinate selection (as well as cyclic 
selection) by a substantial margin in all cases. The Lipschitz sampling strategy can narrow this gap, but 
it remains large (even when an approximate GS rule is used). The difference between GS and randomized 
selection seems to be most dramatic for the -regularized problem; the GS rules tend to focus on the non¬ 
zero variables while most randomized/cyclic updates focus on the zero variables, which tend not to move 
away from zeroj^ Exact coordinate optimization and using the GSL rule seem to give modest but consistent 
improvements. The three non-smooth GS-* rules had nearly identical performance despite their different 
theoretical properties. The GSL-g rule gave better performance than the GS-* rules, while the the GSL-r 
variant performed worse than even cyclic and random strategies. We found it was also possible to make the 
GS-s rule perform poorly by perturbing the initialization away from zero. While these experiments plot the 
performance in terms of the number of iterations, in Appendix [l] we show that the GS-* rules can also be 
advantageous in terms of runtime. 

We next consider an instance of problem ^ 2 , performing label propagation for semi-supervised learning 
in the ‘two moons’ dataset Zhou et al., 2004 . We generate 500 samples from this dataset, randomly label 


five points in the data, and connect each node to its five nearest neighbours. This high level of sparsity is 
typical of graph-based methods for semi-supervised learning, and allows the exact Gauss-Southwell rule to 
be implemented efficiently. We use the quadratic labeling criterion of Bengio et al. 2006 , which allows exact 


coordinate optimization and is normally optimized with cyclic coordinate descent. We plot the performance 
under different selection rules in Figure Here, we see that even cyclic coordinate descent outperforms 
randomized coordinate descent, but that the GS and GSL rules give even better performance. We note that 
the GS and GSL rules perform similarly on this problem since the Lipschitz constants do not vary much. 


10 Discussion 

It is clear that the GS rule is not practical for every problem where randomized methods are applicable. 
Nevertheless, we have shown that even approximate GS rules can obtain better convergence rate bounds than 
fully-randomized methods. We have given a similar justification for the use of exact coordinate optimization, 

®To reduce the cost of the GS-s method in this context, [Shevade and Keerthi|[2003] consider a variant where we first compute 
the GS-s rule for the non-zero variables and if an element is sufficiently large then they do not consider the zero variables. 
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and we note that our argument could also be used to justify the use of exact coordinate optimization within 
randomized coordinate descent methods (as used in our experiments). We have also proposed the improved 
GSL rule, and considered approximate/proximal variants. We expect our analysis also applies to block 


updates by using mixed norms || • and could be used for accelerated/parallel methods Fercoq and 

Richtarik 2013 , for primal-dual rates of dual coordinate ascent [Shalev-Shwartz and Zhang 

2013, for 

successive projection methods Leventhal and Lewis 2010 , for boosting algorithms [Ratsch et al. 

2001 , and 

for scenarios without strong-convexity under general error bounds ILuo and Tseng 

1993 . 
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Appendix A Efficient calculation of GS rules for sparse problems 

We first give additional details on how to calculate the GS rule efficiently for sparse instances of problems 
hi and /12. We will consider the case where each gi is smooth, but the ideas can be extended to allow a 
non-smooth gi. Further, note that the efficient calculation does not rely on convexity, so these strategies can 
also be used for non-convex problems. 

A.l Problem h 2 

Problem /i2 has the form 

h2{x) ■.= ^g,{xi) + ^ f,j{x„Xj), 

iGV {i,j)^E 

where each gi and fij are differentiable and G = {V, E} is a graph where the number of vertices \V\ is the 
same as the number of variables n. If all nodes in the graph have a degree (number of neighbours) bounded 
above by some constant d, we can implement the GS rule in O(dlogn) after an 0{n+\E\) time initialization 
by maintaining the following information about x^: 

1. A vector containing the values \7igi{Xi). 

2. A matrix containing the values Wifij{Xi,x^) in the first column and W j fij {Xi, x^) in the second column. 


Given the heap structure, we can compute the GS rule in 0(1) by simply reading the index value of the root 
node in the max heap. The costs for initializing these structures are: 

1 . 0{n) to compute gi{xi) for all n nodes. 

2. Oi\E\) to compute Vij fij(Xi, Xj) for all \E\ edges. 

3. 0{n + |A|) to sum the values in the above structures to compute \7h{x°), and 0{n) to construct the 
initial max heap. 

Thus, the one-time initialization cost is 0{n + lAl). The costs of updating the data structures after we 
update Xi^ to Xi^^ for the selected coordinate ik are: 

1 . 0(1) to compute 

2. 0{d) to compute '^ijfijixi'^^,x^'^^) for {i,j) S E and i = ik or j = ik (only d such values exist by 
assumption, and all other Vijfij{xi,Xj) are unchanged). 


3. The elements of the gradient vector \/h 2 {x^) stored in a binary max heap data structure [see Cormen 


et ah, 2001 Chapter 6]. 
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3. 0{d) to update up to d elements of that differ from \7h{x^) by using differences in changed 

values of gi and fij, followed by 0{d\ogn) to perform d updates of the heap at a cost of O(logn) for 
each update. 

The most expensive part of the update is modifying the heap, and thus the total cost is 0(dlogn)j^ 

A.2 Problem hi 

Problem hi has the form 

n 

:= ^ g,{xi) + f{Ax), 

i=l 

where gt and / are differentiable, and is an m by n matrix where we denote column i by ai and row j by 
aj. Note that / is a function from IR™ to IR, and we assume only depends on ajx. While this is a 
strong assumption (e.g., it rules out / being the product function), this class includes a variety of notable 
problems like the least squares and logistic regression models from our experiments. If A has 2; non-zero 
elements, with a maximum of c non-zero elements in each column and r non-zero elements in each row, then 
with a pre-processing cost of 0(z) we can implement the GS rule in this setting in 0(cr log n) by maintaining 
the following information about x^: 

1. A vector containing the values \/igi{xf). 

2. A vector containing the product Ax'^. 

3. A vector containing the values Vf{Ax^). 

4. A vector containing the product A^V/(Aa;^). 

5. The elements of the gradient vector \7hi{x^) stored in a binary max heap data structure. 

The heap structure again allows us to compute the GS rule in 0(1), and the costs of initializing these 
structures are: 

1. 0(n) to compute gi{x°) for all n variables. 

2 . 0{z) to compute the product Ax'^. 

3. 0(m) to compute \7f{Ax^) (using that Vj/ only depends on ajx^). 

4. 0{z) to compute A^Vf{Ax^). 

5. 0(n) to add the Vi5i(a:°) to the above product to obtain V/ii(a;°) and construct the initial max heap. 

As it is reasonable to assume that z > m and z > n (e.g., we have at least one non-zero in each row and 
column), the cost of the initialization is thus 0(z). The costs of updating the data structures after we update 
xf, to for the selected coordinate ik are: 

•'k 

1. 0(1) to compute gt^ix’^^^). 

2. 0(c) to update the product using = Ax*^ -I- (x^+^ — xf^)ai, since has at most c non-zero 

values. 

3. 0(c) to update up to c elements of V/(Ax^+^) that have changed (again using that Vjf only depends 
on ajx'"'+^). 

^For less-sparse problems where n < dlogn, using a heap is actually inefficient and we should simply store as a 

vector. The initialization cost is the same, but we can then perform the GS rule in 0{n) by simply searching through the vector 
for the maximum element. 
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4. 0(cr) to perform up to c updates of the form f{Ax’^) + iyjf{Ax^^'^) — 

jf{Axy{ai)^, where each update costs 0{r) since each m has at most r non-zero values. 

5. 0(cr log n) to update the gradients in the heap. 

The most expensive part is again the heap update, and thus the total cost is 0(cr log n). 


Appendix B Relationship between /ii and /i 

We can establish the relationship between /i and by using the known relationship between the 2-norm 
and the 1-norm, 

||a:||i>||x||>^||x||i. 

\/n 

In particular, if we assume that / is /r-strongly convex in the 2-norm, then for all x and y we have 

f{y) >f{x) + yf{x),y-x) + ^\\y-xf 

> f{x) -I- y f{x),y- x) + ^\\y- 

implying that / is at least ^-strongly convex in the 1-norm. Similarly, if we assume that a given / is 
/ri-strongly convex in the 1-norm then for all x and y we have 

f[y) > f{x) + y f(x),y-x) + ^\\y-x\\l 

> f{x) + yf{x),y- x) + y b - , 

implying that / is at least //i-strongly convex in the 2-norm. Summarizing these two relationships, we have 

- < /^i < 

n 


Appendix C Analysis for separable quadratic case 


We first establish an equivalent definition of strong-convexity in the 1-norm, along the lines of |Nesterov| 


2004 Theorem 2.1.9]. Subsequently, we use this equivalent definition to derive fii for a separable quadratic 


function. 


C.l Equivalent definition of strong-convexity 

Assume that / is /ri-strongly convex in the 1-norm, so that for any x,y G K" we have 

fiy) > fix) + yfix),y-x) + y||2/-a;||i. 

Reversing x and y in the above gives 

fix) > fiy) + yfiy),X - y) + ^Wx - y\\l, 

and adding these two together yields 

(V/( 2 /) - V/(a:),y-x) > fii\\y - xWf. (18) 

Conversely, assume that for all x and y we have 

y fiy) fix),y - x) > y.i\\y-x\\l, 
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and consider the function ^(t) = f{x + T{y — x)) for r G ]R. Then 


/(y) - fix) - (V/(a;), y-x) = g{l) - g(0) - (Vf(x), y - x) 

dg 


= ^ix)- (^fix),y-x) dr 

= / fix + riy - x)),y - x) - {Wfix),y - x) dr 

= [ {'^fix + Tiy-x))-\ 7 f{x),y-x) dr 
Jo 

> [ —\\xiy-x)\\l dr 

Jo X 

= / fJ'ix\\y-x\\l dr 
Jo 


= ylll/-*li;. 

Thus, ^i-strong convexity in the 1-norm is equivalent to having 

{'^fiy)-'^fix),y-x)>fii\\y-x\\l 'J x,y. (19) 

C.2 Strong-convexity constant fii for separable quadratic functions 

Consider a strongly convex quadratic function / with a diagonal Hessian H = V^/(x) = diag(Ai,..., A„), 
where Xi > 0 for alH = 1,..., n. We show that in this case 


Ml 


HZ 




-1 


From the previous section, gi is the minimum value such that (19) holds. 


= inf 

x^V 


(V/(y)-V/(a;),y-a;) 


\\y-x\\\ 

Using V/(a;) = Hx -\- b for some b and letting z = y — x, we get 

{{Hy -b)- (Hx -b),y-x) 


fii — inf 

x^y 


= inf 


\\y-x\\t 

{H{y - x),y - x) 


x=v lly-a;||f 

_ ■ 

-1% ||z||? 

= min z'^Hz 
lhlli=i 

n 

= min } Xizf, 

i=l 

where the last two lines use that the objective is invariant to scaling of z and to the sign of z (respectively), 
and where e is a vector containing a one in every position. This is an equality-constrained strictly-convex 
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quadratic program, so its solution is given as a stationary point {z* ,r]*) of the Lagrangian, 




Differentiating with respect to each Zi for i = 1,... ,n and equating to zero, we have for all i that 2\iZ* — nj* = 
0 , or 


2A,' 


( 20 ) 


Differentiating the Lagrangian with respect to rj and equating to zero we obtain \ — e^z* = 0, or equivalently 


which yields 


„-=2 ^ 


Combining this result for rj* with equation (20), we have 




A, 


A, 


This gives the minimizer, so we evaluate the objective at this point to obtain /ri, 

n 

Ml = 


2=1 


-IN 


0 = 1 ^ 


2=1 




, Ai , . , 

2=1 \l = l 


HE^) (Ei 




= Ev 


-1 


Appendix D Gauss-Southwell with exact optimization 

We can obtain a faster convergence for GS using exact coordinate optimization for sparse variants of problems 
hi and ^ 2 , by observing that the convergence rate can be expressed in terms of the sequence of (1 — ^i/Li^,) 
values, 

k , 

n ‘-IT 

i=i ^ ^ 




[/(x°)-/(x*)]. 
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The worst case occurs when the product of the (1 — values is as large as possible. However, using 

exact coordinate optimization guarantees that, after we have updated coordinate i, the GS rule will never 
select it again until one of its neighbours has been selected. Thus, we can obtain a tighter bound on the 
worst-case convergence rate using GS with exact coordinate optimization on iteration k, by solving the 
following combinatorial optimization problem defined on a weighted graph: 

Problem 1 . We are given a graph G = {V,E) with n nodes, a number Mi associated with each node i, and 
an iteration number k. Choose a sequence that maximizes the sum of the Mi_,, subjeet to the following 

constraint: after eaeh time node i has been chosen, it cannot be chosen again until after a neighbour of node 
i has been ehosen. 

We can use the Mi chosen by this problem to obtain an upper-bound on the sequence of log(l —values, 
and if the largest Mi values are not close to each other in the graph, then this rate can be much faster than 
the rate obtained by alternating between the largest Mi values. In the particular case of chain-structured 
graphs, a worst-case sequence can be constructed that spends all but 0{n) iterations in one of two solution 
modes: (i) alternate between two nodes i and j that are connected by an edge with the highest value of 
or (ii) alternate between three nodes {i,j,k} with the highest value of where there is 

an edge from i to j and from j to k, but not from i to k. To show that these are the two solution modes, 
observe that the solution must eventually cycle because there are a finite number of nodes. If you have more 
than three nodes in the cycle, then you can always remove one node from the cycle to obtain a better average 
weight for the cycle without violating the constraint. We will fall into mode (i) if the average of Mi and Mj 
in this mode is larger than the average of Mi, Mj and Mk in the second mode. We can construct a solution 
to this problem that consists of a ‘burn-in’ period, where we choose the largest Mi, followed by repeatedly 
going through the better of the two solution modes up until the final three steps, where a ‘burn-out’ phase 
arranges to finish with several large Mi. By setting Mi = log(l — pii/Li), this leads to a convergence rate of 
the form 

fix^) - fix*) < O (max{pf,pf}'‘’) [f{x°) - fix*)], 

where is the maximizer of ■\/(l — /xi/Li)(l — pi/Lj) among all consecutive nodes i and j in the chain, 
and is the maximizer of ^{1 — pi/Li){l — p,i/Lj){l — pi/Lk) among consecutive nodes i, j, and k. The 
0{) notation gives the constant due to choosing higher (1 — pi/Li) values during the burn-in and burn-out 
periods. The implication of this result is that, if the large Li values are more than two edges away from each 
other in the graph, then the convergence rate can be much faster. 


Appendix E Gauss-Southwell-Lipschitz rule: convergence rate 

The coordinate-descent method with a constant step-size of Li^ uses the iteration 






Because / is coordinate-wise Lij,-Lipschitz continuous, we obtain the following bound on the progress made 
by each iteration: 


/(x'=+i) < fix>^) -f VUix’^)ix’*^" - - x^) 


k\2 




1 




fcM 2 


= fixn-^[^Uixn] 


(21) 


= fix'^) - I 


^^Jix^ 
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By choosing the coordinate to update according to the Gauss-Southwell-Lipchitz (GSL) rule, 


Zfc = argmax 

i 




we obtain the tightest possible bound on ([2l]). We dehne the following norm, 




(22) 


which has a dual norm of 


\x\\ r = max 


* \/Ll 


\Xi\. 


Under this notation, and using the GSL rule, (21) becomes 


Measuring strong-convexity in the norm |j • \\l we get 


f{y) > f{x) + {Vf{x),y-x) + ^h-xWi. 


Minimizing both sides with respect to y we get 

f{x*) > f{x) - snp{{-\7f{x),y - x) - ^\\y - x\\l} 
y ^ 

=f{x)-[^\\-\\iy{-vf{x)) 

= f{x)-y^{\\Vfix)\\lf. 

Putting these together yields 


/(a^'^+i) - fix*) < (1 - yL)[fix'^) - fix*)]. 


(23) 


Appendix F Comparing /i^ to fii and /i 


By the logic Appendix to establish a relationship between different strong-convexity constants under 
different norms, it is sufficient to establish the relationships between the squared norms. In this section, we 
use this to establish the relationship between defined in (22) and both and y. 


F.l Relationship between and /xi 

We have 

c||a;||i - l|a;||L = \x^\ - XI V^kil = V%)\x^\, 

i i i 

Assuming c > -s/Z, where L = maxi{Li}, the expression is non-negative and we get 

\\x\\l < y^\\x\\i. 


By using 

c\\x\\l - Iklli = X(c\/^- l)|a^*|, 

i 
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and assuming c > , where = min^lLi}, this expression is nonnegative and we get 

V ^min 

\\x\\,<^^\\xU. 

V ^min 

The relationship between and fXi is based on the squared norm, so in summary we have 

Ml ^ ^ 

T ~ ~T 

^ ^min 

F.2 Relationship between iii and /t 

Let L denote a vector with elements \/I^, and we note that 


= E(^)' 


1/2 


1/2 


E Li = \/riL, where L = — Li. 

/ 71 


Using this, we have 


This implies that 


\\x\\l = a;^(sign(x) o L) < ||a;|||| sign(a:) o L\\ = v^nL||a;||. 


nL 


< Ml- 


Note that we can also show that ul ^ , but this is less tight than the upper bound from the previous 

■^min 

section because < /i. 


Appendix G Approximate Gauss-Southwell with additive error 

In the additive error regime, the approximate Gauss-Southwell rule chooses an ik satisfying 

> ||V/(a;'')||oo - Cfc, where Cfc > 0 Wk, 

and we note that we can assume < |lV/(x^)||oo without loss of generality because we must always choose 
an i with |Vij,/(x^)| > 0. Applying this to our bound on the iteration progress, we get 


< /(^") 


^{\\y.nx'^)\\^-e,y 
^(llV/(a;'=)||L-2efe||V/(a:'=)|U + eI) 
^l|V/(:r'=)||L + |liV/(x'=)||oo-|- 


(24) 


We first give a result that assumes / is Li-Lipschitz continuous in the 1-norm. This implies an inequality 
that we prove next, followed by a convergence rate that depends on Li. However, note that L < Li < Ln, 
so this potentially introduces a dependency on n. We subsequently give a slightly less concise result that 
has a worse dependency on e but does not rely on Li. 
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G.l Gradient bound in terms of Li 

We say that V/ is Li-Lipschitz continuous in the l-norm if we have for all x and y that 


||V/(a;) - V/(j/)||oo < Li\\x - y\\x. 


Similar to Nesterov 2004 Theorem 2.1.5], we now show that this implies 


f{y) > fix) + (v/(x),2/-x) + ^ll^/(y) - '^fix) 


1 ^, 


(25) 


and subsequently that 

||V/(x'=)||oo = |jv/(cr'=) - V/(a:*)||oo < ^2L^ifix>^)-f{x*)) < ^2L^{f{x^)-f{x*)), (26) 

where we have used that /(x*) < /(x^“^) for all k and any choice of ik-i (this follows from the basic bound 
on the progress of coordinate descent methods). 

We first show that V/ being Li-Lipschitz continuous in the 1-norm implies that 

fiy) < fix) + (V/(x),?/-x) -h Y^ly-xWj, 
for all X and y. Consider the function gir) = fix + T{y — x)) with t € IR. Then 


fiy) - fix) - (Vfix), y-x) = gil) - g(0) - (V/(x), y - x) 

dg 


= ^("t) - (V/(x),y-x) dr 

= / (V/(x-h r(y - x)),y - x) - (V/(x),y - x) dr 
^0 

= [ (V/(x-hr(j/-x)) - V/(x),?/-x) dr 
Jo 

< [ l|V/(x-f r(?/-x))-V/(x)||oo||y-x||i dr 
Jo 

< [ -^iTlId-a^lli dr 
Jo 




To subsequently show (251, fix x G M" and consider the function 


Hy) = fiy) - (v/(x),?/), 


which is convex on IR" and also has an Li-Lipschitz continuous gradient in the 1-norm, as 


Wiy) - </>'(a:)lloo = ||(V/(J/) - V/(x)) - (V/(x) - V/(x))|| 
= l|V/(?/)-V/(x)||oo 
< Li\\y-x\\i. 
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As the minimizer of </> is x (i.e., (j )'= 0)) for any y G IR” we have 

(j){x) = min(;i(7;) < mm(l){y) + {(j)'(y), v - y) + ^\\v - y\\j 

V V Z 

= Hy) - sup{-(j)'{y),v - y) - ^\\v - y\\l 

= Hy)-^W{y)\\l- 


Substituting in the definition of (j), we have 


fix) - {Wfix),x) < fiy) - {Wfix),y) - ^l|V/(y) - V/(cr)||L 

fix) < fiy) + iVfix),x-y) - ^||V/(?/) - V/(a;)||2 

fiy) > fix) + i\7fix),y- x) + ^l|V/(y) - "^fix)\\l 


G.2 Additive error bound in terms of Li 


Using (26) in (24) and noting that > 0, we obtain 

fix^+^) < fix'^) - ^\\yfix’^)\L + |l|V/(x'=) 


2L 


< fix'^) - ^l!V/(cr'=)||L + ^j-^2L,ifixO)-fix*))-^ 

< fix'^) - ^W^fix^nl + ek^Vfix°)-fix*)- 


Applying strong convexity (taken with respect to the 1-norm), we get 

fix^+^) - fix*) < (l - f) [/(^'') - + ^k'^Vfix°)-fix*), 


which implies 


fix^+^)-fix*)< 1- 


= 1 - 


tfl 

L 

tfi 

L 






k r 


where 


fix°) - fix*) + \/fix°) - fix*)Ak 


Ah = 


L 




i=l 


L 


G.3 Additive error bound in terms of L 

By our additive error inequality, we have 

|V,,/(cr'=)|+efc> ||V/(x'=)|| 


fix*) 
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Using this again in (241 we get 


< /(x'=) - ^||V/(a:'=)||^ + |||V/(x'=) 

„feM|2 I , 


2L 


< fi^") - ^l!v/(a.'=)ii^ + j{\y^jix'^)\ + - H 

= - ^l|V/(a;'=)||^ + ||V,,/(x'=)| + ||. 

Further, from our basic progress bound that holds for any ik we have 

1 2 


fix*) < /(^'=+i) < fix^) - F 


^^Jixl 




n 2 


Vz./(a 


which implies 


and thus that 


\W,Jix>^)\ < ^2Lifix^)-fix*)). 


fix’^^^) < fix^) - ^\\^fix'^)\\l + |y2L(/(xO)-/(x*)) + || 


= fix^)-^\\^fix 


k\\\2 
oo 


+ ek\I^Vfix°)-fix*) + ^. 


Applying strong convexity and applying the inequality recursively we obtain 

k k / \ k—i 


fix^+^) - fix*) < ( 1 - ^ 


[fix°)-fix*)]+Y2(^l-(^^i]l^Vfix°) - fix*) + 


= 1 - 


Mi 


fix°) - fix*) + Ak 


where 



y/(xO) - fix*) + 



Although uglier than the expression depending on Li, this expression will tend to be smaller unless is not 
small. 


Appendix H Convergence Analysis of GS-s, GS-r, and GS-q Rules 

In this section, we consider problems of the form 

n 

min Fix) = fix)+gix) = /(x) + V 

tcGiR 

2=1 

where / satishes our usual assumptions, but the gi can be non-smooth. We first introduce some notation 
that will be needed to state our result for the GS-q rule, followed by stating the result and then showing 
that it holds in two parts. We then turn to showing that the rule cannot hold in general for the GS-s and 
GS-r rules. 
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H.l Notation and basic ineqnality 


To analyze this case, an important inequality we will use is that the L-Lipschitz-continuity of V^/ implies 
that for all x, i, and d that 

F{x + dci) = f{x + dci) + g(x + dei) < f{x) + (V/(x), det) + ^d^ + g{x + dci) 

= fix) + gix) + {V fix), da) + ^d^ + + d)- g,{xi) 

= F{x) + Vi{x,d), 

where ^ 

Vi{x,d) = {Vfix),dei) + + gtixi + d) - gi{xi). 

Notice that the GS-q rule is defined by 


ik = argminjmin Vi(a;, d)}. 

i d 


We use the notation d’) = argmin^ d) and we will use d^ to denote the vector containing these values 
for all i. When using the GS-g rule, the iteration is defined by 


= x'" + di^^ei,^ 

= x'^ + argmin{bi^ {x, d)}ei^ 

d 


(28) 


In this notation the GS-r rule is given by 

jk = argmax |df |. 

i 


We will use the notation x'f to be the step that would be taken at Xk if we update coordinate jk according 
the GS-r rule 


xX=x’^ + 

From the optimality of d^, we have for any i that 


- L[(xf - - {x'j + d'j)] e dg.ix>j + df), (29) 

and we will use the notation Sj for the unique element of dgjixj -I- dj) satisfying this relationship. We use 
s* to denote the vector containing these values. 


H.2 Convergence bonnd for GS-g rnle 

Under this notation, we can show that coordinate descent with the GS-g rule satisfies the bound 
F{x'^+^) - F{x*) < min | (^1 - [/(a:'=) - f{x*)], (l - [/(a;'=) - f{x*)] + e^j , 

where 


efc < f (d(4) - gix'^ + ix'^ + d^) - 4)) 

We note that if g is linear then e* = 0 and this convergence rate reduces to 


F{x'^+^) - Fix*) < ( 1 - 4 


Fix’^) - Fix*) 


(30) 


Otherwise, depends on how far g(x^) lies above a particular linear underestimate extending from ix^+df), 
as well as the conditioning of /. We show this result by first showing that the GS-g rule makes at least 
as much progress as randomized selection (first part of the min), and then showing that the GS-g rule also 
makes at least as much progress as the GS-r rule (second part of the min). 
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H.3 GS-g is at least as fast as random 


Our argument in this section follows a similar approach to Richtarik and Takac 2014 . In particular, 
combining (271 and (28) we have the following upper bound on the iteration progress 

< FIccM + min i min d) 1 , 

iG{l,2,....n} l^dGR J 

= F{x^)+ min | min y, - ) I , 

= F(x'^) + min < min Vi(x'^,yi — xf)>, 

yGK" l^iG{1.2,...,n} J 

< F{x'^) + min I - V V,{x\ y, - x>^) \ 

= + i niin I(V/(x'=),y - x'^) + ^\\y - + g(y) - g(x'=) 


n yGlR" 


1 


1 




= 1 _ _ \F{x'^) + - min /(x'=) + {Vf{x% y - x'=) + -||y - x'^^ + giv) ■ 


nj ' ' n 1/GiR" 

From strong convexity of /, we have that F is also /r-strongly convex and that 

< f{y) - (V/(x'=),y - x'^)) x%\ 


F{ax* + (1 — a)x^) < aF{x*) + (1 — a)F{x^) — 


fc a{l-a)y 




for any y S H" and any a € [0,1] [see Nesterov 2004 Theorem 2.1.9]. Using these gives us 


1 


1 


1 


n i/GK" 

1 


T, 


< 1-- F(x ) + - min <^/(y)--||y-x|| +-||y-x|| +y(y) 


= 1-F(x ) + - min F{y) H-^||y - x 


1 


n yGH 

1 


L — y, 


2 

fc ||2 


< 1- \F{x^) H— min ■[ F{ax* + (1 — a)x^) + 


1 


n ctG[o,i] 

1 


< 1- \F{x^) H— min i aF{x*) + (1 — a)F{x^) + 


sli-On 


n o!G[o,i] 


1 


a^{L - y)-a{l- a)y ^ ^ 


\x — X 


*F{x*)F{l-a*)F{x^) 


^choosing a* = ^ g (0,1]^ 


= 1- - F(x'=) + —U(x*) + 




a 


= F{x'^) - —[Fix'") - Fix*)]. 

Subtracting F{x*) from both sides of this inequality gives us 


F(x*^+i) - F{x*) < ( 1 - ^ ) [F(x'=) - Fix- 
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H.4 GS-g is at least as fast as GS-r 


In this section we derive the right side of the bound (30) for the GS-r rule, but note it also applies to the 
GS-g rule because from (p^ and (pSl) we have 


F{x^+^) < F{x'^)+m.\nVi{x,d’l) (GS-g rule) 

i 

< F{x^) + Vj^{x,dj^) {jk selected by the GS-r rule). 

Note that we lose progress by considering a bound based on the GS-r rule, but its connection to the oo-norm 
will make it easier to derive an upper bound. 

By the convexity of we have 

9j, {x'},) > gj, + 4) + + 4)) 

= 5.. (4 + 4) - (-^4 - V,./(a.'=))(4) 

= 5.. (4 + 4) + v,./(^'=)4 + ^(4)^ 

where sf is defined by ( [2^ . Using this we have that 
Fix'^+^) < Fix’^) + V,{x,d’;j 

= F{x^) + V,/(x'=)(4) + |(4)^ + g.(4 + 4) - g,(4) 

< u4) + v,/4)(4) + ^{dir - v,,/4)4 - l(4)2 


= u(4-f(44 


Adding and subtracting F{x*) and noting that jk is selected using the GS-r rule, we obtain the upper bound 


u4+i) - F{x*) < f4) - F(x*) --\\d' 


fc ||2 

OO ' 


(31) 


Recall that we use to denote the iteration that would result if we chose jk and actually performed the 
GS-r update. Using the Lipschitz continuity of the gradient and definition of the GS-g rule again, we have 

L. 


F(x''+p < F(x'=) + Vfix^y (x'=+^ - xy + 


- X 


k \\2 


+ g{x'^+y - g{xy 


< F{xy + Vf{xy {x% - xy + -\\x%-xy\^+ g{xt) - g(x'=) 
= f{xy F Vf{xy^{xX - xy + 4|d'=4 + g{xX) 


By the strong-convexity of /, for any y G IR^ we have 


f(x^) < f{y) - V/(x'=)^(y - xy - y 111/ - 


and using this we obtain 


F{x'^+y < f{y) + v/(x'=)^(x^ 

By the convexity of g and G dg{x^ + d^), we have 

g{y)>9{x^ + dy + {syy-{x'^ + dy). 


■54 


(32) 
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Combining (32) with the above inequality, we have 


- F{y) < (V/(x'=),4 -y)-^\\y- + ^Wd'^Wl 

+ 5(4) - g{x'^ + d^) + (s^ {x’^ + d^) - y). 


We add and subtract on the right-hand side to get 


F{x^+^) - F{y) < (V/(a:'=) + - y) - f^\\y - + ^\\d 


L. 


jfcip 

I OO 


+ 5(4) - g{x^ + d'=) + (s^ {x'^ + d^) - 4). 

Let = g{x^) — g{x^ + d^) + (s^, (a;* + d^) — ), which is non-negative by the convexity g. Making this 

substitution, we have 


F{y) > F(x'=+i) + (-Ld^5 - x^) + ^\\y - - -|irf 


II,. .„feii2 ^ II jfe||2 _ 


Now add and subtract {—Ld^,x^) to the right-hand side and use (29) to get 


F{y) > F{x^+^) + {-Ld\y- 4 + y ^ " A\l - 2 IMllL - - 4) " 

Minimizing both sides with respect to y results in 

F{x*) > F(x'=+i) - ^y\\L -\\\d^\\l. - L{d\x'^ - 4) - 


= F(a;'=+^) - 


fe+i, L{L 5i)||,jfcii2 


2gi 


-c , 


where we have used that x’^ = x^ F d^ and \d^ \ = IM^Hoo- Combining this with equation (31), we get 


L. 


j/c||2 


F{x'^+^) - Fix*) < F{x^) - Fix*) - 

F(a;'=+i) - Fix*) < F(a;'=) - Fix*) - , , 

{L- di) 


Fix^+^) - Fix*) - 


1-h 


Ml 


(■^^-Mi) 


Fix’^^^) - Fix*) 


< Fix^) - Fix*) F ^ 


{L- Fi) 


Fix^+^) - Fix*) < 


{L- Fi) 


Fix'^*^^) - Fix*) < ( 1- 4 


Fix^) - Fix*) 
Fix^) - Fix*) 


„kd-i 


kFi 

Fc^^. 

Ij 


H.5 Lack of progress of the GS-s rule 

We now show that the rate (1 — gi/L), and even the slower rate (1 — g/Ln), cannot hold for the GS-s 
rule. We do this by constructing a problem where an iteration of the GS-s method does not make sufficient 
progress. In particular, consider the bound-constrained problem 

x£C Z 
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where C = {x : a; > 0}, and 



We thus have that 

/(x°) = ^((1 + 1)2 + (.07 + 3)2)«6.7 

/(^*) = ^((-lf+ (-3r)=5 

Vfix°)=A'^iAxo-b)^ (^21) 

The parameter values for this problem are 


n = 2 

M “ ^niin ~ 0-49 

T — ^max — 1 



1 + 


0.49 


0.33, 



where the Xi are the eigenvalues of A^A, and /i and /ii are the corresponding strong-convexity constants for 
the 2 -norm and 1 -norm, respectively. 

The proximal operator of the indicator function is the projection onto the set C, which involves setting 
negative elements to zero. Thus, our iteration update is given by 

= prox[a:'= - jV^J{x'^)ei^] = max{x'^ - jV^J{x’^)ei^,0), 

Sc ^ ^ 


For this problem, the GS-s rule is given by 

i = argmax |? 7 f |, 

i 


where 



V./(a;'=), 

0 , 


if 4 7 ^ 0 or Vi/(x'=) < 0 
otherwise 


Based on the value of V/(a;°), the GS-s rule thus chooses to update coordinate 2, setting it to zero and 
obtaining 

f {x^) = \{{l +If+ {-?>?) = 

Thus we have 

f{xf - fix*) _ 6.5-5 _ 
even though the bounds obtain the faster rates of 



(1 - « (1 -0.33) = 0.67. 
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Thus, the GS-s rule does not satisfy either bound. On the other hand, the GS-r and GS-g rules are given in 
this context by 


Ik = argmax 


max ( — —Vi/(x^)ei, 0 — x" 

1j 


and thus both these rules choose to update coordinate 1, setting it to 
progress ratio of 


f{x^)-f{x*) _ 5.2-5 
fixO)-f{x*) ^6.7-5 


zero to obtain f{x^) 


5.2 and a 


which clearly satisfies both bounds. 


H.6 Lack of progress of the GS-r rule 

We now turn to showing that the GS-r rule does not satisfy these bounds in general. It will not be possible 
to show this for a simple bound-constrained problem since the GS-r and GS-g rules are equivalent for these 
problems. Thus, we consider the following -regularized problem 


min A||a:||i = F{x). 

2 

We use the same A as the previous section, so that n, /i, L, and /xi are the same. However, we now take 


b = 





A=l, 


so we have 


/(a:o)«3.1, /(x*) = 2 


The proximal operator of the absolute value function is given by the soft-threshold function, and our coor¬ 
dinate update of variable ik is given by 

= Prox[a;y"] = sgn(yy) • max(4y - A/L,0), 


A|.| 


where we have used the notation 


The GS-r rule is defined by 




ik = argmax jd 


fc, 

t h 


where = prox;^|.| — x^ and in this case 


d" = 


0.6 

-0.5/ ■ 


Thus, the GS-r rule chooses to update coordinate 1. After this update the function value is 

Fix^) « 2.9, 


so the progress ratio is 


F{x^)-F{x*) 2.9-2 

F{xO)-F{x*) 3.1-2 
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Running time for I 2 - regularized sparse least squares 



Figure 3: Comparison of coordinate selection rules for £ 2 -regularized sparse least squares. 


However, the bounds suggest faster progress ratios of 



« 0.76, 



so the GS-r rule does not satisfy either bound. In contrast, in this setting the GS-q rule chooses to update 
coordinate 2 and obtains F{x^) « 2.2, obtaining a progress ratio of 

F{x^)-F{x*) ^ 2.2-2 ^ 

F{x°)-F{x*) 3.1-2 ■ ’ 


which satisfies both bounds by a substantial margin. Indeed, we used a genetic algorithm to search for a 
setting of the parameters of this problem (values of x^, A, b, and the diagonals of A) that would make the 
GS-q not satisfy the bound depending on /ii, and it easily found counter-examples for the GS-s and GS-r 
rules but was not able to produce a counter example for the GS-q rule. 


Appendix I Runtime Experiments 

In Figure we plot the objective against the runtime for the ^ 2 -regularized sparse least squares problem 
from our experiments. Although runtimes are very sensitive to exact implementation details and we believe 
that more clever implementations than our naive Python script are possible, this figure does show that the 
GS and GSL rules offer benefits in terms of runtime with our implementation and test hardware. 
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